Example for dimensionnality reduction


In [1]:
import pandas as pd
import numpy as np

In [2]:
my_data = pd.DataFrame([1,2,3])

convert integer in to binary string


In [8]:
def to_binary(value):
    return "{0:b}".format(value)

In [10]:
to_binary(5)


Out[10]:
'101'

In [12]:
unique_values = my_data.thrid.unique()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-8db6822080d0> in <module>()
----> 1 unique_values = my_data.thrid.unique()

/Library/Python/2.7/site-packages/pandas-0.18.0-py2.7-macosx-10.11-intel.egg/pandas/core/generic.pyc in __getattr__(self, name)
   2667             if name in self._info_axis:
   2668                 return self[name]
-> 2669             return object.__getattribute__(self, name)
   2670 
   2671     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'thrid'

we apply this to a data we use enumerate to loop then save a dictionary to encapsulate of the values that are unique in binary code, so it give you the index possition and the value, now what we do is that we create a dictionary that maps this individual string in to binary.


In [14]:
my_dict = {}
for index,val in enumerate(unique_vals):
    my_dict[val] = to_binary(index)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-60ee4666a676> in <module>()
      1 my_dict = {}
----> 2 for index,val in enumerate(unique_vals):
      3     my_dict[val] = to_binary(index)
      4 

NameError: name 'unique_vals' is not defined

In [ ]:
my_data["thrid_binary"] = my_data.apply(lambda x: my_dict[x.thrid], axis = 1)
then you will want to split my data binary in to columns and so you split the string in to nothing. you can put a separator in between them and then split on that separator. You have to look at the histogram of the column and see if the top n account for the majority. if they do, you can use those ones, then you'd have a partition for those columns.
It would still be a better model.

In [ ]: